Linear-Time Rule Induction
نویسنده
چکیده
The recent emergence of data mining as a major application of machine learning has led to increased interest in fast rule induction algorithms. These are able to e ciently process large numbers of examples, under the constraint of still achieving good accuracy. If e is the number of examples, many rule learners have O(e 4 ) asymptotic time complexity in noisy domains, and C4.5RULES has been empirically observed to sometimes require O(e 3 ). Recent advances have brought this bound down to O(e log 2 e), while maintaining accuracy at the level of C4.5RULES's. In this paper we present CWS, a new algorithm with guaranteed O(e) complexity, and verify that it outperforms C4.5RULES and CN2 in time, accuracy and output size on two large datasets. For example, on NASA's space shuttle database, running time is reduced from over a month (for C4.5RULES) to a few hours, with a slight gain in accuracy. CWS is based on interleaving the induction of all the rules and evaluating performance globally instead of locally (i.e., it uses a \conquering without separating" strategy as opposed to a \separate and conquer" one). Its bias is appropriate to domains where the underlying concept is simple and the data is plentiful but noisy. Introduction and Previous Work Very large datasets pose special problems for machine learning algorithms. A recent large-scale study found that most algorithms cannot handle such datasets in a reasonable time with a reasonable accuracy (Michie, Spiegelhalter, & Taylor 1994). However, in many areas|including astronomy, molecular biology, nance, retail, health care, etc.|large databases are now the norm, and discovering patterns in them is a potentially very productive enterprise, in which interest is rapidly growing (Fayyad & Uthurusamy 1995). Designing learning algorithms appropriate for such problems has thus become an important research problem. In these \data mining" applications, the main consideration is typically not to maximize accuracy, but to extract useful knowledge from a database. The learner's output should still represent the database's contents with reasonable delity, but it is also important that it be comprehensible to users without machine learning expertise. \If ... then ..." rules are perhaps the most easily understood of all representations currently in use, and they are the focus of this paper. A major problem in data mining is that the data is often very noisy. Besides making the extraction of accurate rules more di cult, this can have a disastrous e ect on the running time of rule learners. In C4.5RULES (Quinlan 1993), a system that induces rules via decision trees, noise can cause running time to become cubic in e, the number of examples (Cohen 1995). When there are no numeric attributes, C4.5, the component that induces decision trees, has complexity O(ea 2 ), where a is the number of attributes (Utgo 1989), but its running time in noisy domains is dwarfed by that of the conversion-to-rules phase (Cohen 1995). Outputting trees directly has the disadvantage that they are typically much larger and less comprehensible than the corresponding rule sets. Noise also has a large negative impact on windowing, a technique often used to speed up C4.5/C4.5RULES for large datasets (Catlett 1991). In algorithms that use reduced error pruning as the simpli cation technique (Brunk & Pazzani 1991), the presence of noise causes running time to become O(e 4 log e) (Cohen 1993). F urnkranz and Widmer (1994) have proposed incremental reduced error pruning (IREP), an algorithm that prunes each rule immediately after it is grown, instead of waiting until the whole rule set has been induced. Assuming the nal rule set is of constant size, IREP reduces running time to O(e log 2 e), but its accuracy is often lower than C4.5RULES's (Cohen 1995). Cohen introduced a number of modi cations to IREP, and veri ed empirically that RIPPERk, the resulting algorithm, is competitive with C4.5RULES in accuracy, while retaining an average running time similar to IREP's (Cohen 1995). Catlett (Catlett 1991) has done much work in making decision tree learners scale to large datasets. A preliminary empirical study of his peepholing technique shows that it greatly reduces C4.5's running time without signi cantly a ecting its accuracy. 1 To the best of our knowledge, peepholing has not been evaluated on any large real-world datasets, and has not been applied to rule learners. A number of algorithms achieve running time linear in e by forgoing the greedy search method used by the learners above, in favor of exhaustive or pruned near-exhaustive search (e.g., (Weiss, Galen, & Tadepalli 1987; Smyth, Goodman, & Higgins 1990; Segal & Etzioni 1994)). However, this causes running time to become exponential in a, leading to a very high cost per example, and making application of those algorithms to large databases di cult. Holte's 1R algorithm (Holte 1993) outputs a single tree node, and is linear in a and O(e log e), but its accuracy is often much lower than C4.5's. Ideally, we would like to have an algorithm capable of inducing accurate rules in time linear in e, without becoming too expensive in other factors. This paper describes such an algorithm and its empirical evaluation. The algorithm is presented in the next section, which also derives its worst-case time complexity. A comprehensive empirical evaluation of the algorithm is then reported and discussed.
منابع مشابه
A General Rule for the Influence of Physical Damping on the Numerical Stability of Time Integration Analysis
The influence of physical damping on the numerical stability of time integration analysis is an open question since decades ago. In this paper, it is shown that, under specific very general conditions, physical damping can be disregarded when studying the numerical stability. It is also shown that, provided the specific conditions are met, analysis of structural systems involved in extremely hi...
متن کاملLightweight Rule Induction
A lightweight rule induction method is described that generates compact Disjunctive Normal Form (DNF) rules. Each class has an equal number of unweighted rules. A new example is classi ed by applying all rules and assigning the example to the class with the most satis ed rules. The induction method attempts to minimize the training error with no pruning. An overall design is speci ed by setting...
متن کاملFast iscovery of Sim
The recent emergence of data mining as a major application of machine learning has led to increased interest in fast rule induction algorithms. These are able to efficiently process large numbers of examples, under the constraint of still achieving good accuracy. If e is the number of examples, many rule learners have O(e4) asymptotic time complexity in noisy domains, and C4.5RULES has been emp...
متن کاملSpeed Control of Induction Motor using Fuzzy Rule Base
The induction motors are characterized by complex, highly nonlinear and time-varying dynamics, and hence their speed control is a challenging engineering problem. The advent of vector control techniques has partially solved induction motor control problems, but they are sensitive to drive parameter variations and performance may deteriorate if conventional controllers are used. By exploiting th...
متن کاملEfficient Specific-to-General Rule Induction
RISE (Domingos 1995; in press) is a rule induction algorithm that proceeds by gradually generalizing rules, starting with one rule per example. This has several advantages compared to the more common strategy of gradually specializing initially null rules, and has been shown to lead to significant accuracy gains over algorithms like CGRULES and CN2 in a large number of application domains. Howe...
متن کاملA Hyper-Heuristic for Descriptive Rule Induction
Rule induction from examples is a machine learning technique that finds rules of the form condition → class, where condition and class are logic expressions of the form variable1 = value1 ∧ variable2 = value2 ∧... ∧ variablek = valuek. There are in general three approaches to rule induction: exhaustive search, divide-and-conquer, and separateand-conquer (or its extension as weighted covering). ...
متن کامل